Midterm Project Exercise: HR ANALYTICS EMPLOYEE ATTRITION AND PERFORMANCE

BCon 147: special topics

Author

Christine Abejero

Published

October 25, 2024

1 Project overiew

In this project, we will explore employee attrition and performance using the HR Analytics Employee Attrition & Performance dataset. The primary goal is to develop insights into the factors that contribute to employee attrition. By analyzing a range of factors, including demographic data, job satisfaction, work-life balance, and job role, we aim to help businesses identify key areas where they can improve employee retention.

2 Scenario

Imagine you are working as a data analyst for a mid-sized company that is experiencing high employee turnover, especially among high-performing employees. The company has been facing increased costs related to hiring and training new employees, and management is concerned about the negative impact on productivity and morale. The human resources (HR) team has collected historical employee data and now looks to you for actionable insights. They want to understand why employees are leaving and how to retain talent effectively.

Your task is to analyze the dataset and provide insights that will help HR prioritize retention strategies. These strategies could include interventions like revising compensation policies, improving job satisfaction, or focusing on work-life balance initiatives. The success of your analysis could lead to significant cost savings for the company and an increase in employee engagement and performance.

3 Understanding data source

The dataset used for this project provides information about employee demographics, performance metrics, and various satisfaction ratings. The dataset is particularly useful for exploring how factors such as job satisfaction, work-life balance, and training opportunities influence employee performance and attrition.

This dataset is well-suited for conducting in-depth analysis of employee performance and retention, enabling us to build predictive models that identify the key drivers of employee attrition. Additionally, we can assess the impact of various organizational factors, such as training and work-life balance, on both performance and retention outcomes.

## datatable function from DT package create an HTML widget display of the dataset
## install DT package if the package is not yet available in your R environment
readxl::read_excel("dataset/dataset-variable-description.xlsx") |> 
  DT::datatable()

4 Data wrangling and management

Libraries

Task: Load the necessary libraries

Before we start working on the dataset, we need to load the necessary libraries that will be used for data wrangling, analysis and visualization. Make sure to load the following libraries here. For packages to be installed, you can use the install.packages function. There are packages to be installed later on this project, so make sure to install them as needed and load them here.

# load all your libraries here
library(readr)
library(dplyr)
library(DT)
library(janitor)
library(ggplot2)
library(plotly)
library(GGally)
library(stats)
library(sjPlot)
library(gridExtra)
library(report)
library(ggstatsplot)
library(scales)
library(tidyr)

4.1 Data importation

Task 4.1. Merging dataset
  • Import the two dataset Employee.csv and PerformanceRating.csv. Save the Employee.csv as employee_dta and PerformanceRating.csv as perf_rating_dta.

  • Merge the two dataset using the left_join function from dplyr. Use the EmployeeID variable as the varible to join by. You may read more information about the left_join function here.

  • Save the merged dataset as hr_perf_dta and display the dataset using the datatable function from DT package.

## import the two data here
employee_dta <- read_csv("C:/Users/1/Desktop/MY VSU/4TH YEAR/1st Semester/Special Topic/Midterm Project/midterm-bcon147-project-exercise-20241017T013024Z-001/midterm-bcon147-project-exercise/dataset/Employee.csv")

perf_rating_dta <- read_csv("C:/Users/1/Desktop/MY VSU/4TH YEAR/1st Semester/Special Topic/Midterm Project/midterm-bcon147-project-exercise-20241017T013024Z-001/midterm-bcon147-project-exercise/dataset/PerformanceRating.csv")

## merge employee_dta and perf_rating_dta using left_join function.
## save the merged dataset as hr_perf_dta
hr_perf_dta <- left_join(employee_dta, perf_rating_dta, by = "EmployeeID")

## Use the datatable from DT package to display the merged dataset
datatable(hr_perf_dta)

4.2 Data management

Task 4.2. Standardizing variable names
  • Using the clean_names function from janitor package, standardize the variable names by using the recommended naming of variables.

  • Save the renamed variables as hr_perf_dta to update the dataset.

## clean names using the janitor packages and save as hr_perf_dta
hr_perf_dta <- hr_perf_dta |>  
  clean_names()

## display the renamed hr_perf_dta using datatable function
datatable(hr_perf_dta)
Task 4.2. Recode data entries
  • Create a new variable cat_education wherein education is 1 = No formal education; 2 = High school; 3 = Bachelor; 4 = Masters; 5 = Doctorate. Use the case_when function to accomplish this task.

  • Similarly, create new variables cat_envi_sat, cat_job_sat, and cat_relation_sat for environment_satisfaction, job_satisfaction, and relationship_satisfaction, respectively. Re-code the values accordingly as 1 = Very dissatisfied; 2 = Dissatisfied; 3 = Neutral; 4 = Satisfied; and 5 = Very satisfied.

  • Create new variables cat_work_life_balance, cat_self_rating, cat_manager_rating for work_life_balance, self_rating, and manager_rating, respectively. Re-code accordingly as 1 = Unacceptable; 2 = Needs improvement; 3 = Meets expectation; 4 = Exceeds expectation; and 5 = Above and beyond.

  • Create a new variable bi_attrition by transforming attrition variable as a numeric variabe. Re-code accordingly as No = 0, and Yes = 1.

  • Save all the changes in the hr_perf_dta. Note that saving the changes with the same name will update the dataset with the new variables created.

hr_perf_dta <- hr_perf_dta |> 
  
## create cat_education
  mutate(
    cat_education = case_when(
      education == 1 ~ "No formal education",
      education == 2 ~ "High school",
      education == 3 ~ "Bachelor",
      education == 4 ~ "Masters",
      education == 5 ~ "Doctorate",
      TRUE ~ NA_character_ #ensures that any unrecognized values in the `education` column are assigned as `NA` 
    ),

## create cat_envi_sat,  cat_job_sat, and cat_relation_sat
    cat_envi_sat = case_when(
      environment_satisfaction == 1 ~ "Very dissatisfied",
      environment_satisfaction == 2 ~ "Dissatisfied",
      environment_satisfaction == 3 ~ "Neutral",
      environment_satisfaction == 4 ~ "Satisfied",
      environment_satisfaction == 5 ~ "Very satisfied",
      TRUE ~ NA_character_  
    ),
    cat_job_sat = case_when(
      job_satisfaction == 1 ~ "Very dissatisfied",
      job_satisfaction == 2 ~ "Dissatisfied",
      job_satisfaction == 3 ~ "Neutral",
      job_satisfaction == 4 ~ "Satisfied",
      job_satisfaction == 5 ~ "Very satisfied",
      TRUE ~ NA_character_  
    ),
    cat_relation_sat = case_when(
      relationship_satisfaction == 1 ~ "Very dissatisfied",
      relationship_satisfaction == 2 ~ "Dissatisfied",
      relationship_satisfaction == 3 ~ "Neutral",
      relationship_satisfaction == 4 ~ "Satisfied",
      relationship_satisfaction == 5 ~ "Very satisfied",
      TRUE ~ NA_character_  
    ),


## create cat_work_life_balance, cat_self_rating, and cat_manager_rating
    cat_work_life_balance = case_when(
      work_life_balance == 1 ~ "Unacceptable",
      work_life_balance == 2 ~ "Needs improvement",
      work_life_balance == 3 ~ "Meets expectation",
      work_life_balance == 4 ~ "Exceeds expectation",
      work_life_balance == 5 ~ "Above and beyond",
      TRUE ~ NA_character_  
    ),
    cat_self_rating = case_when(
      self_rating == 1 ~ "Unacceptable",
      self_rating == 2 ~ "Needs improvement",
      self_rating == 3 ~ "Meets expectation",
      self_rating == 4 ~ "Exceeds expectation",
      self_rating == 5 ~ "Above and beyond",
      TRUE ~ NA_character_  
    ),
     cat_manager_rating = case_when(
      manager_rating == 1 ~ "Unacceptable",
      manager_rating == 2 ~ "Needs improvement",
      manager_rating == 3 ~ "Meets expectation",
      manager_rating == 4 ~ "Exceeds expectation",
      manager_rating == 5 ~ "Above and beyond",
      TRUE ~ NA_character_  
    ),

## create bi_attrition
     bi_attrition = case_when(
      attrition == "No" ~ 0,
      attrition == "Yes" ~ 1,
      TRUE ~ NA_real_  # indicating a missing value (NA) that is of numeric type
    )
)


## print the updated hr_perf_dta using datatable function
datatable(hr_perf_dta)

5 Exploratory data analysis

5.1 Descriptive statistics of employee attrition

Task 5.1. Breakdown of attrition by key variables
  • Select the variables attrition, job_role, department, age, salary, job_satisfaction, and work_life_balance. Save as attrition_key_var_dta.

  • Compute and plot the attrition rate across job_role, department, and age, salary, job_satisfaction, and work_life_balance. To compute for the attrition rate, group the dataset by job role. Afterward, you can use the count function to get the frequency of attrition for each job role and then divide it by the total number of observations. Save the computation as pct_attrition. Do not forget to ungroup before storing the output. Store the output as attrition_rate_job_role.

  • Plot for the attrition rate across job_role has been done for you! Study each line of code. You have the freedom to customize your plot accordingly. Show your creativity!

## selecting attrition key variables and save as `attrition_key_var_dta`
attrition_key_var_dta <- hr_perf_dta |> 
  select(attrition, job_role, department, age, salary, job_satisfaction, cat_job_sat, work_life_balance,  cat_work_life_balance)


## compute the attrition rate across job_role and save as attrition_rate_job_role
attrition_rate_job_role <- attrition_key_var_dta |> 
  group_by(job_role) |> 
  count(attrition) |> 
  mutate(pct_attrition = n / sum(n)) |> 
  ungroup()

##Christine's added comment: Filter only attrition cases (attrition == "Yes")
attrition_rate_job_role <- attrition_rate_job_role |> 
  filter(attrition == "Yes")


## print attrition_rate_job_role
attrition_rate_job_role
# A tibble: 11 × 4
   job_role                  attrition     n pct_attrition
   <chr>                     <chr>     <int>         <dbl>
 1 Analytics Manager         Yes          28        0.131 
 2 Data Scientist            Yes         597        0.430 
 3 Engineering Manager       Yes          18        0.0586
 4 HR Executive              Yes          29        0.244 
 5 Machine Learning Engineer Yes          95        0.163 
 6 Manager                   Yes          19        0.131 
 7 Recruiter                 Yes          86        0.566 
 8 Sales Executive           Yes         543        0.347 
 9 Sales Representative      Yes         317        0.634 
10 Senior Software Engineer  Yes          84        0.164 
11 Software Engineer         Yes         445        0.324 
## Plot the attrition rate
p <- ggplot(attrition_rate_job_role, aes(x = reorder(job_role, pct_attrition), y = pct_attrition)) + #By default, R orders categorical variables alphabetically which does not convey any meaningful information about attrition rates. Reorder function was utilized to change the order of the job_role categories based on their associated pct_attrition values, from lowest to highest attrition. 
  geom_col(fill = "#70945f", color = "black", width = 0.7, size = 0.15) +
  labs(title = "Attrition Rate by Job Role", 
       x = "Job Role", 
       y = "Attrition Rate (%)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), # for readability in the variable names placed in the x-axis
        panel.grid = element_line(color = "grey", size = 0.5),
        plot.title = element_text(face = "bold", size = 16.2, hjust = 0.5)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(0, 1)) #accuracy within percent_format is used to round to the nearest whole number

#custom tooltip with percentage of attrition
p <- p + aes(text = paste("Attrition Rate:", 
                          "<b>", scales::percent(pct_attrition, accuracy = 1), "</b>"))
ggplotly(p, tooltip = "text")
## Christine's added code chunk 

## compute the attrition rate across department
attrition_rate_department <- attrition_key_var_dta |> 
  group_by(department) |> 
  count(attrition) |> 
  mutate(pct_attrition = n / sum(n)) |> 
  ungroup()

##Filter only attrition cases (attrition == "Yes")
attrition_rate_department <- attrition_rate_department |> 
  filter(attrition == "Yes")

## Plot the attrition rate
q <- ggplot(attrition_rate_department, aes(x = reorder(department, pct_attrition), y = pct_attrition)) +  
  geom_col(fill = "#2c494a", color = "black", width = 0.7, size = 0.15) +
  labs(title = "Attrition Rate by Department", 
       x = "Department", 
       y = "Attrition Rate (%)") +
  theme_minimal() +
  theme(panel.grid = element_line(color = "grey", size = 0.5),
        plot.title = element_text(face = "bold", size = 16.2, hjust = 0.5)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(0, 1))

#custom tooltip with percentage of attrition
q <- q + aes(text = paste("Attrition Rate:", 
                          "<b>", scales::percent(pct_attrition, accuracy = 1), "</b>"))
ggplotly(q, tooltip = "text")
## Christine's added code chunk 

##Check first the min and max value
summary(attrition_key_var_dta$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   18.0    25.0    28.0    30.6    36.0    51.0 
##Check the data type of age variable
str(attrition_key_var_dta) 
tibble [6,899 × 9] (S3: tbl_df/tbl/data.frame)
 $ attrition            : chr [1:6899] "No" "No" "No" "No" ...
 $ job_role             : chr [1:6899] "Sales Executive" "Sales Executive" "Sales Executive" "Sales Executive" ...
 $ department           : chr [1:6899] "Sales" "Sales" "Sales" "Sales" ...
 $ age                  : num [1:6899] 30 30 30 30 30 30 30 30 30 38 ...
 $ salary               : num [1:6899] 102059 102059 102059 102059 102059 ...
 $ job_satisfaction     : num [1:6899] 3 4 5 3 4 2 5 2 5 3 ...
 $ cat_job_sat          : chr [1:6899] "Neutral" "Satisfied" "Very satisfied" "Neutral" ...
 $ work_life_balance    : num [1:6899] 4 2 4 3 3 3 4 2 5 5 ...
 $ cat_work_life_balance: chr [1:6899] "Exceeds expectation" "Needs improvement" "Exceeds expectation" "Meets expectation" ...
#filter non-numeric values before converting to avoid data loss (which i personally experience lol)
attrition_key_var_dta <- attrition_key_var_dta |> 
  filter(!is.na(age))

##convert to numeric since it is stored as factor
attrition_key_var_dta <- attrition_key_var_dta |> 
  mutate(age = as.numeric(as.character(age))) 

# Create age groups
attrition_key_var_dta <- attrition_key_var_dta |> 
  mutate(age_group = cut(age, 
                         breaks = c(18, 23, 28, 33, 38, 43, 48, 53),
                         labels = c("18-23", "23-28", "28-33", "33-38", "38-43", "43-48", "48+"), 
                         right = FALSE)) # to exclude upper bound



## compute the attrition rate across age
attrition_rate_age <- attrition_key_var_dta |> 
  group_by(age_group) |> 
  count(attrition) |> 
  mutate(pct_attrition = n / sum(n)) |> 
  ungroup()

##Filter only attrition cases (attrition == "Yes")
attrition_rate_age <- attrition_rate_age |> 
  filter(attrition == "Yes")

## Plot the attrition rate
r <- ggplot(attrition_rate_age, aes(x = age_group, y = pct_attrition)) +  
  geom_col(fill = "#ff84c3", color = "black", width = 0.7, size = 0.15) +
  labs(title = "Attrition Rate by Age", 
       x = "Age", 
       y = "Attrition Rate (%)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid = element_line(color = "grey", size = 0.5),
        plot.title = element_text(face = "bold", size = 16.2, hjust = 0.5)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(0, 1))

#custom tooltip with percentage of attrition
r <- r + aes(text = paste("Attrition Rate:", 
                          "<b>", scales::percent(pct_attrition, accuracy = 1), "</b>"))
ggplotly(r, tooltip = "text")
## Christine's added code chunk 

##Check first the min and max value
summary(attrition_key_var_dta$salary)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  20387   44646   74458  110898  137220  547204 
# Set minimum and maximum salary values
min_salary <- 20387
max_salary <- 547204

# Define bin width (choose based on your analysis goals)
bin_width <- 50000  

# Calculate breaks based on the bin width
breaks <- seq(min_salary, max_salary + bin_width, by = bin_width)

# Create salary groups
attrition_key_var_dta <- attrition_key_var_dta |> 
  mutate(salary_group = cut(salary, 
                             breaks = breaks, 
                             labels = paste(head(breaks, -1), tail(breaks, -1), sep = "-"), 
                             right = FALSE))


## compute the attrition rate across age
attrition_rate_salary <- attrition_key_var_dta |> 
  group_by(salary_group) |> 
  count(attrition) |> 
  mutate(pct_attrition = n / sum(n)) |> 
  ungroup()

##Filter only attrition cases (attrition == "Yes")
attrition_rate_salary <- attrition_rate_salary |> 
  filter(attrition == "Yes")

## Plot the attrition rate
s <- ggplot(attrition_rate_salary, aes(x = salary_group, y = pct_attrition)) +  
  geom_col(fill = "#9a6c57", color = "black", width = 0.7, size = 0.15) +
  labs(title = "Attrition Rate by Salary", 
       x = "Salary", 
       y = "Attrition Rate (%)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid = element_line(color = "grey", size = 0.5),
        plot.title = element_text(face = "bold", size = 16.2, hjust = 0.5)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(0, 1))

#custom tooltip with percentage of attrition
s <- s + aes(text = paste("Attrition Rate:", 
                          "<b>", scales::percent(pct_attrition, accuracy = 1), "</b>"))
ggplotly(s, tooltip = "text")
## Christine's added code chunk 

## compute the attrition rate across job satisfaction using cat_job_sat
attrition_rate_job_satisfaction <- attrition_key_var_dta |> 
  group_by(cat_job_sat) |> 
  count(attrition) |> 
  mutate(pct_attrition = n / sum(n)) |> 
  ungroup()

##Filter only attrition cases (attrition == "Yes")
attrition_rate_job_satisfaction <- attrition_rate_job_satisfaction |> 
  filter(attrition == "Yes")

#To reflect the right sequence in the graph
attrition_rate_job_satisfaction <- attrition_rate_job_satisfaction |> 
  mutate(cat_job_sat = factor(cat_job_sat, 
                              levels = c("Very dissatisfied", 
                                         "Dissatisfied", 
                                         "Neutral", 
                                         "Satisfied", 
                                         "Very satisfied")))

## Plot the attrition rate
t <- ggplot(attrition_rate_job_satisfaction, aes(x = cat_job_sat, y = pct_attrition)) +  
  geom_col(fill = "#a72219", color = "black", width = 0.7, size = 0.15) +
  labs(title = "Attrition Rate by Job Satisfaction", 
       x = "Job Satisfaction", 
       y = "Attrition Rate (%)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid = element_line(color = "grey", size = 0.5),
        plot.title = element_text(face = "bold", size = 16.2, hjust = 0.5)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(0, 1))

#custom tooltip with percentage of attrition
t <- t + aes(text = paste("Attrition Rate:", 
                          "<b>", scales::percent(pct_attrition, accuracy = 1), "</b>"))
ggplotly(t, tooltip = "text")
## Christine's added code chunk 

## compute the attrition rate across work-life balance using cat_work_life_balance 
attrition_rate_work_life_balance <- attrition_key_var_dta |> 
  group_by(cat_work_life_balance) |> 
  count(attrition) |> 
  mutate(pct_attrition = n / sum(n)) |> 
  ungroup()

##Filter only attrition cases (attrition == "Yes")
attrition_rate_work_life_balance <- attrition_rate_work_life_balance |> 
  filter(attrition == "Yes")

#To reflect the right sequence in the graph
attrition_rate_work_life_balance <- attrition_rate_work_life_balance |> 
  mutate(cat_work_life_balance = factor(cat_work_life_balance, 
                              levels = c("Unacceptable", 
                                         "Needs improvement", 
                                         "Meets expectation", 
                                         "Exceeds expectation", 
                                         "Above and beyond")))

## Plot the attrition rate
u <- ggplot(attrition_rate_work_life_balance, aes(x = cat_work_life_balance, y = pct_attrition)) +  
  geom_col(fill = "#ffe366", color = "black", width = 0.7, size = 0.15) +
  labs(title = "Attrition Rate by Work-Life Balance", 
       x = "Work-Life Balance", 
       y = "Attrition Rate (%)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid = element_line(color = "grey", size = 0.5),
        plot.title = element_text(face = "bold", size = 16.2, hjust = 0.5)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(0, 1))

#custom tooltip with percentage of attrition
u <- u + aes(text = paste("Attrition Rate:", 
                          "<b>", scales::percent(pct_attrition, accuracy = 1), "</b>"))
ggplotly(u, tooltip = "text")

5.2 Identifying attrition key drivers using correlation analysis

Task 5.2. Conduct a correlation analysis to identify key drivers
  • Conduct a correlation analysis of key variables: bi_attrition, salary, years_at_company, job_satisfaction, manager_rating, and work_life_balance. Use the cor() function to run the correlation analysis. Remove missing values using the na.omit() before running the correlation analysis. Save the output in hr_corr.

  • Use a correlation matrix or heatmap to visualize the relationship between these variables and attrition. You can use the GGally package and use the ggcorr function to visualize the correlation heatmap. You may explore this site for more information: ggcorr.

  • Discuss which factors seem most correlated with attrition and what that suggests about why employees are leaving.

## Christine's added code chunk 

## Remove missing values
hr_perf_dta_no_NA <- na.omit(hr_perf_dta)

## conduct correlation of key variables. 
hr_corr <- cor(hr_perf_dta_no_NA[, c("bi_attrition", "salary", "years_at_company", "job_satisfaction", "manager_rating", "work_life_balance")])

## print hr_corr 
hr_corr
                  bi_attrition       salary years_at_company job_satisfaction
bi_attrition       1.000000000 -0.211181478    -0.6896527798     0.0132368129
salary            -0.211181478  1.000000000     0.2206442116     0.0053054850
years_at_company  -0.689652780  0.220644212     1.0000000000     0.0008700583
job_satisfaction   0.013236813  0.005305485     0.0008700583     1.0000000000
manager_rating    -0.007654429 -0.001596736     0.0178656879    -0.0158205481
work_life_balance  0.003428836 -0.001517145     0.0079339508     0.0417242942
                  manager_rating work_life_balance
bi_attrition        -0.007654429       0.003428836
salary              -0.001596736      -0.001517145
years_at_company     0.017865688       0.007933951
job_satisfaction    -0.015820548       0.041724294
manager_rating       1.000000000       0.007996938
work_life_balance    0.007996938       1.000000000
# Create a correlation heatmap
ggcorr(hr_perf_dta_no_NA[, c("bi_attrition", "salary", "years_at_company", "job_satisfaction", "manager_rating", "work_life_balance")],
       label = TRUE,
       label_size = 3,
       label_color = "black",
       hjust = 0.75,
       low = "#ff3f31", high = "#149127", mid = "#ffeb38", #actual parameters for postive, zero, and negative numbers color coding
       digits = 2) +
   labs(title = "Correlation Heatmap of Selected Variables", 
       subtitle = "Analyzing Relationships Between Key Factors and Attrition") +
  theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),  # Centered title
        plot.subtitle = element_text(hjust = 0.5, size = 12))

## install GGally package and use ggcorr function to visualize the correlation
install.packages("GGally")
Discussion:

The variable that has a strong correlation with attrition, with a correlation coefficient of -7, is the variable ‘years_at_company’. The negative relationship suggests that as employees spend more years at the company, it decreases the likelihood of attrition as they may have developed strong loyalty and commitment to the company and also offers them job security.

5.3 Predictive modeling for attrition

Task 5.3. Predictive modeling for attrition
  • Create a logistic regression model to predict employee attrition using the following variables: salary, years_at_company, job_satisfaction, manager_rating, and work_life_balance. Save the model as hr_attrition_glm_model. Print the summary of the model using the summary function.

  • Install the sjPlot package and use the tab_model function to display the summary of the model. You may read the documentation here on how to customize your model summary.

  • Also, use the plot_model function to visualize the model coefficients. You may read the documentation here on how to customize your model visualization.

  • Discuss the results of the logistic regression model and what they suggest about the factors that contribute to employee attrition.

## run a logistic regression model to predict employee attrition
## save the model as hr_attrition_glm_model
hr_attrition_glm_model <- glm(bi_attrition ~ salary + years_at_company + job_satisfaction + manager_rating + work_life_balance, 
                              data = hr_perf_dta_no_NA, 
                              family = binomial) #specifies the model to be logistic regression with binary dependent variable

## print the summary of the model using the summary function
summary(hr_attrition_glm_model)

Call:
glm(formula = bi_attrition ~ salary + years_at_company + job_satisfaction + 
    manager_rating + work_life_balance, family = binomial, data = hr_perf_dta_no_NA)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)        2.571e+00  2.173e-01  11.831   <2e-16 ***
salary            -3.633e-06  4.086e-07  -8.893   <2e-16 ***
years_at_company  -6.333e-01  1.476e-02 -42.919   <2e-16 ***
job_satisfaction   3.470e-02  3.186e-02   1.089    0.276    
manager_rating     5.071e-03  3.810e-02   0.133    0.894    
work_life_balance  2.587e-02  3.198e-02   0.809    0.419    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8574.5  on 6708  degrees of freedom
Residual deviance: 4781.6  on 6703  degrees of freedom
AIC: 4793.6

Number of Fisher Scoring iterations: 5
## install sjPlot package and use tab_model function to display the summary of the model
 install.packages("sjPlot")

# Display the logistic regression model summary in a tabular format
tab_model(hr_attrition_glm_model, 
          title = "<strong style='text-align:center;'>Logistic Regression Model Predicting Employee Attrition</strong>", 
          show.ci = TRUE, 
          show.se = TRUE, 
          show.stat = TRUE)
Logistic Regression Model Predicting Employee Attrition
  bi attrition
Predictors Odds Ratios std. Error CI Statistic p
(Intercept) 13.08 2.84 0.00 – Inf 11.83 <0.001
salary 1.00 0.00 0.00 – Inf -8.89 <0.001
years at company 0.53 0.01 0.00 – Inf -42.92 <0.001
job satisfaction 1.04 0.03 0.00 – Inf 1.09 0.276
manager rating 1.01 0.04 0.00 – Inf 0.13 0.894
work life balance 1.03 0.03 0.00 – Inf 0.81 0.419
Observations 6709
R2 Tjur 0.502
## use plot_model function to visualize the model coefficients

z <- plot_model(hr_attrition_glm_model, 
           type = "est", 
           show.values = TRUE, 
           value.offset = 0.3, 
           title = "Model Coefficients - Employee Attrition",
           ci.lvl = 0.95,
           colors = "Set1", 
           axis.title = c("Variables", "Coefficient Estimate"), 
           value.size = 4, 
           axis.labels = c("Work-Life Balance", "Manager Rating", "Job Satisfaction", "Years at Company", "Salary"), 
           grid = TRUE,  
           theme = theme_bw())

## plot_model uses some default plot settings and it might interfere with theme() customization if integrated with the above code
z <- z + theme_bw() +
          theme(plot.title = element_text(hjust = 0.5, face = "bold"))

## Add legend to distinguish colors
legend <- grid::textGrob("Red = Negative Coefficients\nBlue = Positive Coefficients", 
                          gp = grid::gpar(col = "black", fontsize = 9.5, fontface = "italic"), 
                          just = "left", 
                          hjust = 0, 
                          x = unit(0.05, "npc"))  # Force padding for left alignment

# Combine the plot and the legend in a layout
grid.arrange(z, legend, nrow = 2, heights = c(5, 0.5))

##Christine's concern: if run by itself, this code chunk, the visual doesn't show up in the viewer(or does it take longer time to load?). But when rendered it is there.
Discussion:

It is apparent, based on the p-values of each variable, that only ‘salary’ and ‘year at company’ are statistically significant at 1% level of significance, together with the intercept. It so happen that both variables also are negatively correlated with the dependent variable, bi attrition. This indicate that higher salary and longer tenure have the ability to reduce the likelihood of employees leaving.

5.4 Analysis of compensation and turnover

Task 5.4. Analyzing compensation and turnover
  • Compare the average monthly income of employees who left the company (bi_attrition = 1) and those who stayed (bi_attrition = 0). Use the t.test function to conduct a t-test and determine if there is a significant difference in average monthly income between the two groups. Save the results in a variable called attrition_ttest_results.

  • Install the report package and use the report function to generate a report of the t-test results.

  • Install the ggstatsplot package and use the ggbetweenstats function to visualize the distribution of monthly income for employees who left and those who stayed. Make sure to map the bi_attrition variable to the x argument and the salary variable to the y argument.

  • Visualize the salary variable for employees who left and those who stayed using geom_histogram with geom_freqpoly. Make sure to facet the plot by the bi_attrition variable and apply alpha on the histogram plot.

  • Provide recommendations on whether revising compensation policies could be an effective retention strategy.

## compare the average monthly income of employees who left and those who stayed
attrition_ttest_results <- t.test(salary ~ bi_attrition, data = hr_perf_dta_no_NA)


## print the results of the t-test
print(attrition_ttest_results)

    Welch Two Sample t-test

data:  salary by bi_attrition
t = 19.074, df = 5557.5, p-value < 2.2e-16
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 39387.67 48411.52
sample estimates:
mean in group 0 mean in group 1 
      125856.35        81956.76 
## install the report package and use the report function to generate a report of the t-test results
install.packages("report")

# Generate a report of the t-test results
attrition_report <- report(attrition_ttest_results)
print(attrition_report)
Effect sizes were labelled following Cohen's (1988) recommendations.

The Welch Two Sample t-test testing the difference of salary by bi_attrition
(mean in group 0 = 1.26e+05, mean in group 1 = 81956.76) suggests that the
effect is positive, statistically significant, and medium (difference =
43899.59, 95% CI [39387.67, 48411.52], t(5557.53) = 19.07, p < .001; Cohen's d
= 0.51, 95% CI [0.46, 0.57])
# install ggstatsplot package and use ggbetweenstats function to visualize the distribution of monthly income for employees who left and those who stayed
install.packages("ggstatsplot")

ggbetweenstats(data = hr_perf_dta_no_NA, 
               x = bi_attrition, 
               y = salary, 
               xlab = "Attrition (0 = Stayed, 1 = Left)",
               ylab = "Monthly Income", 
               title = "Monthly Income Distribution: Employees Who Left vs. Stayed")

# create histogram and frequency polygon of salary for employees who left and those who stayed

ggplot(hr_perf_dta_no_NA, aes(x = salary, fill = as.factor(bi_attrition))) + 
  geom_histogram(alpha = 0.5, position = "identity", bins = 30) + 
  geom_freqpoly(aes(y = ..density..), bins = 30, color = "black") + 
  facet_wrap(~ bi_attrition, scales = "free") +
  labs(title = "Salary Distribution: Employees Who Left vs. Stayed",
       x = "Salary", y = "Count", fill = "Attrition Status") +
  scale_fill_manual(values = c("0" = "#f25a0f", "1" = "#485e5b")) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 13))  

Discussion:

Determine the Significant Difference:

Having ran the t-test for comparison of the average monthly income of employees who left the company and those who stayed, the resulting p-value, 2.2e-16, suggests that there is a statistically significant difference in average salaries between the two groups since it is lower than the common alpha level of 5%. It can also be discern that the employees who stayed have a higher salary than those who left.

Recommendations on revising compensation policies as an effective retention strategy:

With the substantial differences in the salaries between employees who stayed and left, revising the compensation policies is a crucial and effective employee retention strategy. For instance, by adjusting salaries especially for those at risk of leaving may enhance retention and boost overall job satisfaction. Introducing performance-based incentives also creates a mutually beneficial arrangement which aligns employee goals with organizational success. Another strategy is by offering a comprehensive benefits package, which can include health insurance and retirement plans,that can significantly improve employee satisfaction and loyalty. Lastly, the company can invest in training programs and career development initiatives for employees to see opportunities for growth within the organization thus reduce the likelihood of them living the company.

5.5 Employee satisfaction and performance analysis

Task 5.5. Analyzing employee satisfaction and performance
  • Analyze the average performance ratings (both ManagerRating and SelfRating) of employees who left vs. those who stayed. Use the group_by and count functions to calculate the average performance ratings for each group.

  • Visualize the distribution of SelfRating for employees who left and those who stayed using a bar plot. Use the ggplot function to create the plot and map the SelfRating variable to the x argument and the bi_attrition variable to the fill argument.

  • Similarly, visualize the distribution of ManagerRating for employees who left and those who stayed using a bar plot. Make sure to map the ManagerRating variable to the x argument and the bi_attrition variable to the fill argument.

  • Create a boxplot of salary by job_satisfaction and bi_attrition to analyze the relationship between salary, job satisfaction, and attrition. Use the geom_boxplot function to create the plot and map the salary variable to the x argument, the job_satisfaction variable to the y argument, and the bi_attrition variable to the fill argument. You need to transform the job_satisfaction and bi_attrition variables into factors before creating the plot or within the ggplot function.

  • Discuss the results of the analysis and provide recommendations for HR interventions based on the findings.

# Analyze the average performance ratings (both ManagerRating and SelfRating) of employees who left vs. those who stayed.
average_ratings <- hr_perf_dta_no_NA %>%
  group_by(bi_attrition) %>%
  summarise(
    Avg_SelfRating = mean(self_rating, na.rm = TRUE),
    Avg_ManagerRating = mean(manager_rating, na.rm = TRUE),
    Count = n()  # Count of employees in each group
  )

print(average_ratings)
# A tibble: 2 × 4
  bi_attrition Avg_SelfRating Avg_ManagerRating Count
         <dbl>          <dbl>             <dbl> <int>
1            0           3.98              3.48  4448
2            1           3.99              3.46  2261
# Count occurrences of each category in the original data
count_data <- hr_perf_dta_no_NA |>
group_by(cat_self_rating, bi_attrition) |>
  summarise(count = n(), .groups = 'drop')

# All possible categories (with zero counts)
all_categories <- data.frame(cat_self_rating = c(
  "Unacceptable", "Needs improvement",
  "Meets expectation", "Exceeds expectation",
  "Above and beyond"))

# Join with all categories to include those with zero counts
final_data <- all_categories |>
  left_join(count_data, by = "cat_self_rating") |>
  mutate(count = replace_na(count, 0),
         bi_attrition = factor(bi_attrition, levels = c(0, 1), labels = c("Stayed", "Left"))) #displaying values with zero counts visually highlight gaps in the data collection or response patternswhich could indicate that those options were either not applicable or deemed irrelevant by respondents.

# Visualize the distribution of SelfRating for employees who left and those who stayed using a bar plot.
f <- ggplot(final_data, aes(x = cat_self_rating, y = count, fill = as.factor(bi_attrition),
                            text = paste("Count:", "<b>", count, "</b>"))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Distribution of Self Rating: Employees Who Left vs. Stayed",
       x = "Self Rating", y = "Count", fill = "Attrition Status") +
   scale_fill_manual(values = c("Stayed" = "#f25a0f", "Left" = "#485e5b")) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 13, face = "bold"),
        axis.text.x = element_text(angle = 45, hjust = 1))
ggplotly(f, tooltip = "text")
# Count occurrences of each category in the original data
count_data_2 <- hr_perf_dta_no_NA |>
  group_by(cat_manager_rating, bi_attrition) |>
  summarise(count = n(), .groups = 'drop')

# All possible categories (with zero counts)
all_categories_2 <- data.frame(cat_manager_rating = c(
  "Unacceptable", "Needs improvement",
  "Meets expectation", "Exceeds expectation",
  "Above and beyond"))

# Join with all categories to include those with zero counts
final_data_2 <- all_categories_2 |>
  left_join(count_data_2, by = "cat_manager_rating") |>
  mutate(count = replace_na(count, 0),
         bi_attrition = factor(bi_attrition, levels = c(0, 1), labels = c("Stayed", "Left")))

# Visualize the distribution of ManagerRating for employees who left and those who stayed using a bar plot.
j <- ggplot(final_data_2, aes(x = cat_manager_rating, y = count, fill = as.factor(bi_attrition),
                              text = paste("Count:", "<b>", count, "</b>"))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Distribution of Manager Rating: Employees Who Left vs. Stayed",
       x = "Manager Rating", y = "Count", fill = "Attrition Status") +
  scale_fill_manual(values = c("Stayed" = "#f25a0f", "Left" = "#485e5b")) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 13, face = "bold"),
        axis.text.x = element_text(angle = 45, hjust = 1))
ggplotly(j, tooltip = "text")
# Convert variables to factors
hr_perf_dta_no_NA$cat_job_sat <- factor(hr_perf_dta_no_NA$cat_job_sat, levels = c("Very dissatisfied", "Dissatisfied", "Neutral", "Satisfied", "Very satisfied"))
hr_perf_dta_no_NA$bi_attrition <- as.factor(hr_perf_dta_no_NA$bi_attrition)

# create a boxplot of salary by job_satisfaction and bi_attrition to analyze the relationship between salary, job satisfaction, and attrition.
ggplot(hr_perf_dta_no_NA, aes(x = salary, y = cat_job_sat, fill = bi_attrition)) +
  geom_boxplot() +
  labs(title = "Salary by Job Satisfaction and Attrition",
       x = "Salary", y = "Job Satisfaction", fill = "Attrition Status") +
  scale_fill_manual(values = c("0" = "#f25a0f", "1" = "#485e5b"), 
                    labels = c("0" = "Stayed", "1" = "Left")) +
  scale_x_continuous(labels = comma) + #to convert scientific notation to standard numbers
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 13, face = "bold"))

#NEEDS TO BE SCRUINIZED AND ENHANCED
Discussion:

BASED ON AVERAGE PERFORMANCE RATINGS FOR EACH GROUP (Summary):

The self-ratings of employees who stayed and left are very similar within both groups, suggesting that self-ratings alone may not be a good predictor of attrition. However, the other table shows that those who stay longer have higher manager ratings which could potentially indicate a correlation between manager perception and attrition. The larger number of those employees who stayed compared to those who left might suggest that the company has a low attrition rate. However, it is important to consider that 2261 employees leaving is still a significant number, and it could be bad for the company’s image.

BOXPLOT VISUALIZATION:

There is a clear trend of increasing salary with higher job satisfaction, as shown by the rightward shift of the salary distribution from “Very dissatisfied” to “Very satisfied.” Employees who left tend to have lower salaries than those who stayed, regardless of satisfaction, and their salaries are clustered within a narrower range. In contrast, salaries of those who stayed are more varied. Overall, the boxplot suggests a correlation between higher job satisfaction and higher salaries.

RECOMMENDATION:

The visuals suggest that salary compensation may be a significant factor in employee retention. Before doing adjustment on the salary structure it is important to make sure that pay levels remain competitive within the industry and for the role, with consideration to the company’s financial resources and ability to afford the proposed salary adjustments. The company can have the option to implement a system that fairly compensates employees based on their contribution and performance to make employees feel more appreciated and value, ultimately making them less likely to leave the company. The company can also review other factors that are affecting employee retention such as training opportunities, flexible work arrangements, work-life balance and company culture.

5.6 Work-life balance and retention strategies

Task 5.6. Analyzing work-life balance and retention strategies

At this point, you are already well aware of the dataset and the possible factors that contribute to employee attrition. Using your R skills, accomplish the following tasks:

  • Analyze the distribution of WorkLifeBalance ratings for employees who left versus those who stayed.

  • Use visualizations to show the differences.

  • Assess whether employees with poor work-life balance are more likely to leave.

You have the freedom how you will accomplish this task. Be creative and provide insights that will help HR develop effective retention strategies.

ggplot(hr_perf_dta_no_NA, aes(x = cat_work_life_balance, fill = as.factor(bi_attrition))) +
  geom_bar(position = "dodge", color = "black", size = 0.05) +
  labs(title = "Work-Life Balance Ratings by Employee Status",
       x = "Work-Life Balance Rating",
       fill = "Employee Status") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 13, face = "bold"),
         axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_manual(values = c("0" = "#f25a0f", "1" = "#485e5b"), 
                    breaks = c("0", "1"), 
                    labels = c("Stayed", "Left")) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1)))

Discussion:

Satisfied employees, those having higher work-life balance rating, are more likely to remain in the business. What this means for the HR of the organization is to strengthen there existing policies that promotes healthy work-life balance while also aiming to improve it by providing tailored support, such as time off incentives, for those who have “needs improvement” as their rating. Perhaps, HR can conduct interviews or surveys to better understand the concerns of their employees in terms of their work-life balance satisfaction.

5.7 Recommendations for HR interventions

Task 5.7. Recommendations for HR interventions

Based on the analysis conducted, provide recommendations for HR interventions that could help reduce employee attrition and improve overall employee satisfaction and performance. You may use the following question as guide for your recommendations and discussions.

Question:What are the key factors contributing to employee attrition in the company?

Answer:

By just looking at the graphs we can pinpoint potential contributors to attrition but these alone doesn’t provide conclusive evidence. The following key factors, based only on graphs are:

  • job role
  • department
  • age
  • salary
  • manager rating
  • work-life balance

Attrition rate appears to be inconsistent and has varying heights of bars across different values of these variables in the x-axis which can indicate potentiality of being contributing factors.

In the correlation test ‘years at company’ shows the strongest correlation with attrition among all the variable but if we follow that threshold of -/+ 0.8 then this cannot suffice. And correlation does not mean causation.

Question:Which factors are most strongly correlated with attrition?

Answer:

Under the correlation test, the variable ‘years at company’ possess a strong correlation with attrition. If concern is on what variables have significant effect on attrition, ‘salary’ and ‘years at company’ shows statistical significance although the effect to attrition is negative.

Question:What strategies could be implemented to improve employee retention and satisfaction?

Answer:

  • Salary-related strategies have a significant say on the attrition concern of the organization. Making sure that company adopts industry standard salary which ensures competitive compensation can reduce the likelihood of attrition attempt as this will attract talents. Introducing a performance-based pay is another strategy that will enhance job satisfaction as this will incentives employees to excel in their roles while also contributing to the organization’s success. Another approach would be the provision of benefit package, which could include health insurance and bonuses, to improve employee satisfaction.

  • Work-Life Balance related strategies can also be a significant supplement on the approaches that can minimize employee retention. Other than strengthening the existing policies promoting healthy work-life balance,the organization can also provided an added tailored support, such as flexible work arrangement or encouragement to use vacation days and personal leave, to ensure employee satisfaction wherein they can feel that they are taken care of by the organization, thus avoiding employee attrition.

Question:How can HR leverage the insights from the analysis to develop effective retention strategies?

Answer:

  • In making sure that there is competitive compensation that adheres to industry standard salary, the HR team can gather and analyze market salary data systematically. For the implementation of performance-based pay, they can set clear performance metric and establish roll out performance based pay systems. And to realized the tailored support fo benefit packages for the employees, the HR team can conduct a survey among employees to have a clear picture of their preferences and needs.

  • To further promote work-life balance, the HR department of the organization can enhance flexible work arrangement through giving employees the option to work remotely or compressed workweek while making sure that success of the organization is not compromised. To make use the effectiveness of this approach, a regular review of these policies through employee feedback is a must. Similarly to the encouragement of time off work, which creates culture that values rest and rejuvenation, The organization can have reminders about their vacation policies.

Question:What are the potential benefits of implementing these strategies for the company?

Answer:

Implementing these strategies can really help reduce employee turnover and improve retention by making the company a desirable place to work. By offering competitive salaries and attractive benefits, the organization can attract top talent and build a positive workplace culture. Performance-based pay will motivate employees to perform at their best, boosting job satisfaction and engagement. Additionally, promoting work-life balance through flexible arrangements and encouraging employees to take their vacation time can prevent burnout and improve overall morale. In the end, these efforts will create a more cohesive team and support long-term success for the organization.